Serial No. 10/671,889 

Docket No. YOR920030170US1 (YOR.464) 

AMENDMENTS TO THE SPECIFICATION: 

In the latest Office Action and during the telephone interview dated May 13, 2009, the 
Examiner indicated that various of the recently-revised paragraphs below, previously added 
by Applicants' previous amendment, raises the issue of new matter. 

Although Applicants continue to disagree with the Examiner's position, in an effort to 
expedite prosecution, the five paragraphs, previously incorporated by reference from these 
co-pending applications, are now further revised, as follows. Support for the amendments 
below are described elsewhere in this paper. These following sentences are intended to be 
inserted immediately preceding the subtitle "Level 3 Prefetching of Kernel Routines" on page 
12 of the specification: 

The present invention includes using data stored in non-standard format, including, 
more particularly, the non-standard format described in co-pending application 10/671,888, 
referred to herein as "register block" format. 

The present invention also is directed to Single Instruction, Multiple Data (SIMD) 
machines, where k > 1 indicates a number of data capable of being simultaneously moved in 
a single instruction. Thus, in the example described in the third of the above-identified co- 
pending applications, wherein the register block format was demonstrated using a 2-by-2 
block, referred to therein as a "pseudo-matrix", k = 4. 

The register block data format exemplarily used in the present invention involves 
blocks of matrix data of size p-by-q where p and q are small integers so that the pieces of 
these blocks can be fitted into the registers of a particular architecture to achieve a desirable 
data format stored in these registers. The layout of these blocks is arbitrary. In usual cases, 
the p-by-q sub-blocks will be laid out either in row- or column-major format. But a key idea 
is that the arbitrary layout of these blocks is tailored to the architectural design of the FPU 
and its associated floating point registers. 

All modern programming languages (C, Fortran, etc.) store matrices in two- 
dimensional arrays. Howovor, this layout can bo proved to bo one dimensional. That is, let 
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matrix A have M rows and N columns. The standard column major format of A is as follows. 

Each of the N columns of A is stored as a contiguous vector (stride 1). Each of the M 
rows of A is stored with consecutive elements separated by LDA (Leading Dimension of A) 
storage locations (Stride LDA). Let A(0,0) be stored in memory location a. The matrix 
element A(i,j) is stored in memory location a + i + LDA*j. It is important to note here that 
stride 1 is optimal for memory accesses and that stride LDA is poor. Also, almost all level 3 
linear algebra code treats rows and columns about equally. 
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